RFC: Add TensorForest classifier and regressor to canned estimators #3

yupbank · 2018-06-29T01:46:19Z

Since tree algorithm is one of the most popular algorithm used in kaggle competition
and we already have a contrib project tensor_forest and people like them. If would be beneficial to move them inside of canned estimators.

cc: @nataliaponomareva

ewilderj · 2018-06-29T15:21:34Z

Adding the overview information. This review will remain open for comment until the end of Monday, July 16th (allowing for public holidays).

TensorForest Estimator

Status	Proposed
Author(s)	Peng Yu (yupbank@gmail.com)
Sponsor	Natalia P (Google)
Updated	2018-06-26

Objective

In this doc, we discuss the TensorForest Estimator API, which enables a user to create
Extremely Randomized Forest Classifier and Regressor. And by inheriting from the Estimator class, all the corresponding interfaces will be supported.

martinwicke

My main comment would be that we should start with the minimum set of parameters that gives users the flexibility they need. It seems some of the parameters are not that useful, could we remove them to make the API simpler?

Some questions:

Could we have benchmarks for this?
Could you discuss whether there are efficiencies to be had for whole batch training? We spent a lot of time on such questions for the boosted tree Estimator, and I don't think we need to go into that much detail, but I would like to know whether there are obvious improvements we can make. Sometimes, this type of thing can influence the API (e.g., but requiring a separate pretraining input or something).

martinwicke · 2018-07-02T20:43:31Z

rfcs/20180626-tensor-forest.md

+*   **label_vocabulary:**  A list of strings represents possible label values. If given, labels must be string type and have any value in `label_vocabulary`. If it is not given, that means labels are already encoded as integer or float within [0, 1] for `n_classes=2` and encoded as integer values in {0, 1,..., n_classes-1} for `n_classes`>2 .  Also there will be errors if vocabulary is not provided and labels are string.
+*   **n_trees:** The number of trees to create. Defaults to 100. There usually isn't any accuracy gain from using higher values.
+*   **max_nodes:**  Defaults to 10,000. No tree is allowed to grow beyond max_nodes nodes, and training stops when all trees in the forest are this large.
+*   **num_splits_to_consider:** Defaults to `sqrt(num_features)` capped to be between 10 and 1000. In the extremely randomized tree training algorithm, only this many potential splits are evaluated for each tree node.


Are the 10 and 1000 boundaries universally accepted?

Nit, I would say "clipped", to my ear, "capped" only works for the upper bound.

for now i just borrowed from the original contrib implementation. it is not universal though. not sure why origin author implemented this

not really, sklearn ExtraTree is using sqrt(num_features) as default.
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py#L1192-L1202

so maybe i should remove this clip heuristic ?

martinwicke · 2018-07-02T20:44:41Z

rfcs/20180626-tensor-forest.md

+*   **max_nodes:**  Defaults to 10,000. No tree is allowed to grow beyond max_nodes nodes, and training stops when all trees in the forest are this large.
+*   **num_splits_to_consider:** Defaults to `sqrt(num_features)` capped to be between 10 and 1000. In the extremely randomized tree training algorithm, only this many potential splits are evaluated for each tree node.
+*   **split_after_samples:** Defaults to 250. In our online version of extremely randomized tree training, we pick a split for a node after it has accumulated this many training samples.
+*   **bagging_fraction:** If less than 1.0, then each tree sees only a different, random sampled (without replacement), bagging_fraction sized subset of the training data. Defaults to 1.0 (no bagging) because it fails to give any accuracy improvement our experiments so far.


If this gives no improvement, can we remove this argument?

We can always add stuff back, but we can never take it away (except at major versions) so we should be conservative in what we add.

for now i just borrowed from the original contrib implementation, but thanks for your suggestion, i guess i can use the benchmark tool to find out whether the original claim is valid

since the origin paper did have some numbers suggestion bootstrapping is not helping, i'll remove it form the api for now.

yupbank · 2018-07-04T02:51:53Z

For benchmark, yeah we might use this https://www.openml.org/search?q=ExtraTrees&type=flow

the efficiencies to be had for whole batch training?

In this case not really, since the trees in tensor forest we are using are Hoeffding Tree, which is a incremental tree, so we don't require full batch training

ewilderj · 2018-08-09T15:44:20Z

@nataliaponomareva @martinwicke I think we're good to merge this now. Waiting for your LGTM and I'll merge.

martinwicke · 2018-08-09T17:49:30Z

Can we reflect the discussion notes somewhere here? Could be as a link to a doc, even in the comment thread. I just don't want them lost. @tanzhenyu

ewilderj · 2018-08-09T17:50:42Z

Agreed, before we've linked them at the bottom of the RFC. Either that or including them in this PR thread would also work, and we'll link the PR discussion at the bottom of the RFC.

tanzhenyu · 2018-08-09T18:21:58Z

Talked with Edd offline, he will post it within the ready-to-pushed rfc.

ewilderj · 2018-08-09T18:22:54Z

Notes from the review committee meeting on 2018-08-07:

In core: agreed
Batch prediction; extra effort currently done: for inference using TensorFrame.
Reusing ops for Boosted trees: hard as completed different. Could have the same photo, and prediction & growing tree ops; Will look deeper on what things can be commonly done
Interface: head, take binary/multi-class/regressor head. The loss is different, not using loss in the head. The head will define metrics we need. Question is can any head work with TFForestEstimator? Answer: it should be. Should be able to detect what metrics/predictions/loss are neglected by the estimator. Will do that in the constructor.
Number of classes: will support multi class in the first version.
Random seed: in runConfig, use that instead. When we’re randomly sampling we will need the seed, also for shuffling and batching as well, currently supported
dense numerical columns to start. Will add sparse column next, just supporting current use case for now.
n_trees, max_nodes, maximum memory size is 2G, batch must fit in memory, we could potentially split it out if size of trees and proto limit causes OOM.
Use None rather than 0 to represent no seed, Yes.
label_dimension, the dimension of the output label.
Additional features: sample weight as the start. Sparse feature as next. Should we ask other contributors? Yu Peng will do it.
how to get done for external? Add API review label. We could do the prediction in a separate PR. Train can be splitter as local training, distributed training. But graph should be same for both.

nataliaponomareva · 2018-08-09T18:29:38Z

rfcs/20180626-tensor-forest.md

+
+- Simplified code with only limited subset of features (obviously, excluding all the experimental ones)
+- New estimator interface, support for new feature columns and losses
+


can you also copy over this from the doc
We will try to reuse as much code from canned boosted trees as possible (proto, inference etc)

ewilderj · 2018-08-09T18:47:09Z

rfcs/20180626-tensor-forest.md

+### Interface
+### TensorForestClassifier
+
+```


use ```python to get the syntax highlighting?

ewilderj · 2018-08-09T18:47:20Z

rfcs/20180626-tensor-forest.md

+
+### TensorForestRegressor
+
+```


use ```python to get the syntax highlighting?

nataliaponomareva · 2018-08-09T20:32:11Z

rfcs/20180626-tensor-forest.md

+4. Otherwise, `(x_i, y_i)` is used to update the statistics of every split in the growing statistics of leaf `l_i`. If leaf `l_i` has now seen `split_after_samples` data points since creating all of its potential splits, the split with the best score is chosen, and the tree structure is grown.
+
+
+## BenchMark


nataliaponomareva · 2018-08-09T20:34:43Z

rfcs/20180626-tensor-forest.md

+|Covertype| 581k| 54| 7| 83.0| 85.0|
+|HiGGS| 11M| 28| 2| 70.9| 71.7|
+
+With single machine training, TensorForest finishes much faster on big dataset like HIGGS, takes about one percent of the time scikit-lean required.


hm where is time n this table? It is just performance metrics right?

yeah, it's just performance metrics, i took it from the workshop paper.

But what i am saying that this statement "much faster on big dataset" is not substantiated by this table. You either keep the table and say that it is from resource A, demonstrating that the quality is on par with scikit learn, and removing the statement that says that it trains faster. Or add a reference to the resource which states that it trains faster

i see.. i'll add a source to the workshop paper, as the claim was also from the paper.

Update TFX notebook RFC after comments

Update 20191016-dlpack-support.md

Callback changes based on discussion

Karenou · 2021-02-24T06:40:37Z

Hi, may I know whether the tensor forest package is still supported under tensorflow 2.0.x? Many thanks.

Update RFC

yupbank added 2 commits June 26, 2018 23:15

add wip

80324db

add regressor

3a76211

yupbank requested review from ewilderj and martinwicke as code owners June 29, 2018 01:46

ewilderj added the RFC: Proposed RFC Design Document label Jul 2, 2018

ewilderj changed the title ~~Propose Add tensor forest classifier and regressor to canned estimators~~ Add TensorForest classifier and regressor to canned estimators Jul 2, 2018

ewilderj requested a review from ispirmustafa July 2, 2018 18:23

martinwicke reviewed Jul 2, 2018

View reviewed changes

address the comment

16d1708

yupbank force-pushed the propose-tensor-forest-estimator branch from 874907c to 16d1708 Compare July 8, 2018 02:33

yupbank added 2 commits July 15, 2018 16:43

including natalia"s comment

db86784

add natlias nice improving

58d0bf8

yupbank force-pushed the propose-tensor-forest-estimator branch from 500ecaa to 58d0bf8 Compare July 21, 2018 01:48

ewilderj changed the title ~~Add TensorForest classifier and regressor to canned estimators~~ RFC: Add TensorForest classifier and regressor to canned estimators Aug 1, 2018

incoporating all the latest comments

fc56dbc

yupbank force-pushed the propose-tensor-forest-estimator branch from e84b8d9 to fc56dbc Compare August 9, 2018 01:08

ewilderj added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Aug 9, 2018

nataliaponomareva reviewed Aug 9, 2018

View reviewed changes

yupbank added 2 commits August 9, 2018 14:34

add more content in the differences from the latest contrib section

74650ab

update status

424fe65

yupbank force-pushed the propose-tensor-forest-estimator branch from c363771 to 424fe65 Compare August 9, 2018 18:38

ewilderj reviewed Aug 9, 2018

View reviewed changes

high light python syntax

4e2326c

nataliaponomareva reviewed Aug 9, 2018

View reviewed changes

yupbank added 2 commits August 10, 2018 10:14

address comments

0b86700

add perfromance source

030a574

nataliaponomareva approved these changes Aug 10, 2018

View reviewed changes

ewilderj merged commit 61fca22 into tensorflow:master Aug 10, 2018

yupbank deleted the propose-tensor-forest-estimator branch August 10, 2018 18:02

theadactyl pushed a commit that referenced this pull request Oct 3, 2019

Merge pull request #3 from charlesccychen/master

42b08e2

Update TFX notebook RFC after comments

ematejska pushed a commit that referenced this pull request Apr 15, 2020

Merge pull request #3 from VoVAllen/patch-2

975e96f

Update 20191016-dlpack-support.md

omalleyt12 pushed a commit to omalleyt12/community that referenced this pull request Nov 30, 2020

Merge pull request tensorflow#3 from omalleyt12/rick_ps

b558d81

Callback changes based on discussion

ematejska pushed a commit that referenced this pull request Mar 4, 2022

Merge pull request #3 from wchao1115/tfdml_update

5d36ec3

Update RFC


		- Simplified code with only limited subset of features (obviously, excluding all the experimental ones)
		- New estimator interface, support for new feature columns and losses

		4. Otherwise, `(x_i, y_i)` is used to update the statistics of every split in the growing statistics of leaf `l_i`. If leaf `l_i` has now seen `split_after_samples` data points since creating all of its potential splits, the split with the best score is chosen, and the tree structure is grown.


		## BenchMark


		### TensorForestRegressor

		```

RFC: Add TensorForest classifier and regressor to canned estimators #3

RFC: Add TensorForest classifier and regressor to canned estimators #3

Uh oh!

Conversation

yupbank commented Jun 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ewilderj commented Jun 29, 2018

TensorForest Estimator

Objective

Uh oh!

martinwicke left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yupbank Jul 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yupbank commented Jul 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ewilderj commented Aug 9, 2018

Uh oh!

martinwicke commented Aug 9, 2018

Uh oh!

ewilderj commented Aug 9, 2018

Uh oh!

tanzhenyu commented Aug 9, 2018

Uh oh!

ewilderj commented Aug 9, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Karenou commented Feb 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yupbank commented Jun 29, 2018 •

edited

Loading

yupbank Jul 8, 2018 •

edited

Loading

yupbank commented Jul 4, 2018 •

edited

Loading